To facilitate research on text generation, this paper presents a comprehensive and unified library, TextBox 2.0, focusing on the use of pre-trained language models (PLMs). To be comprehensive, our library covers $13$ common text generation tasks and their corresponding $83$ datasets and further incorporates $45$ PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs. We also implement $4$ efficient training strategies and provide $4$ generation objectives for pre-training new PLMs from scratch. To be unified, we design the interfaces to support the entire research pipeline (from data loading to training and evaluation), ensuring that each step can be fulfilled in a unified way. Despite the rich functionality, it is easy to use our library, either through the friendly Python API or command line. To validate the effectiveness of our library, we conduct extensive experiments and exemplify four types of research scenarios. The project is released at the link: https://github.com/RUCAIBox/TextBox.
translated by 谷歌翻译
Deformable image registration, i.e., the task of aligning multiple images into one coordinate system by non-linear transformation, serves as an essential preprocessing step for neuroimaging data. Recent research on deformable image registration is mainly focused on improving the registration accuracy using multi-stage alignment methods, where the source image is repeatedly deformed in stages by a same neural network until it is well-aligned with the target image. Conventional methods for multi-stage registration can often blur the source image as the pixel/voxel values are repeatedly interpolated from the image generated by the previous stage. However, maintaining image quality such as sharpness during image registration is crucial to medical data analysis. In this paper, we study the problem of anti-blur deformable image registration and propose a novel solution, called Anti-Blur Network (ABN), for multi-stage image registration. Specifically, we use a pair of short-term registration and long-term memory networks to learn the nonlinear deformations at each stage, where the short-term registration network learns how to improve the registration accuracy incrementally and the long-term memory network combines all the previous deformations to allow an interpolation to perform on the raw image directly and preserve image sharpness. Extensive experiments on both natural and medical image datasets demonstrated that ABN can accurately register images while preserving their sharpness. Our code and data can be found at https://github.com/anonymous3214/ABN
translated by 谷歌翻译
目前缺乏利用对象关系的目前有效的基于LIDAR的检测框架,这些框架自然而然地以空间和时间的方式存在。为此,我们引入了一个简单,高效且有效的两阶段检测器,称为RET3D。 RET3D的核心是利用新颖的框架内和框架间关系模块,以相应地捕获空间和时间关系。更具体地说,框内关系模块(Intrarm)将框架内对象封装到稀疏图中,从而使我们能够通过有效的消息传递来完善对象特征。另一方面,框架间关系模块(Interm)密集地将每个对象动态地连接到相应的跟踪序列中,并利用此类时间信息以通过轻量级变压器网络有效地增强其表示形式。我们使用基于中心的或基于锚的探测器实例化Intram和Interm的新颖设计,并在Waymo Open数据集(WOD)上对其进行评估。由于额外的额外开销可忽略不计,RET3D实现了最先进的性能,就1级1和2级MAPH指标而言,在车辆检测方面分别比最近的竞争对手高出5.5%和3.2%。
translated by 谷歌翻译
随着移动设备的快速开发,现代使用的手机通常允许用户捕获4K分辨率(即超高定义)图像。然而,对于图像进行示范,在低级视觉中,一项艰巨的任务,现有作品通常是在低分辨率或合成图像上进行的。因此,这些方法对4K分辨率图像的有效性仍然未知。在本文中,我们探索了Moire模式的删除,以进行超高定义图像。为此,我们提出了第一个超高定义的演示数据集(UHDM),其中包含5,000个现实世界4K分辨率图像对,并对当前最新方法进行基准研究。此外,我们提出了一个有效的基线模型ESDNET来解决4K Moire图像,其中我们构建了一个语义对准的比例感知模块来解决Moire模式的尺度变化。广泛的实验表明了我们的方法的有效性,这可以超过最轻巧的优于最先进的方法。代码和数据集可在https://xinyu-andy.github.io/uhdm-page上找到。
translated by 谷歌翻译
预计机器学习算法的大多数实际问题都可以通过1)未知数据分配来解决这种情况; 2)小领域特定知识; 3)注释有限的数据集。我们通过使用潜在变量(NPC-LV)的压缩提出非参数学习,这是任何数据集的学习框架,这些数据集具有丰富的未标记数据,但很少有标签的数据。通过仅以无监督的方式训练生成模型,该框架利用数据分配来构建压缩机。使用源自Kolmogorov复杂性的基于压缩机的距离度量,加上很少的标记数据,NPC-LV无需进一步的训练而进行分类。我们表明,在低数据制度中,NPC-LV在图像分类的所有三个数据集上都优于监督方法,甚至超过了CIFAR-10上的半监督学习方法。我们证明了如何以及何时使用负面证据下降(Nelbo)作为分类的近似压缩长度。通过揭示压缩率和分类精度之间的相关性,我们说明在NPC-LV下,生成模型的改进可以增强下游分类精度。
translated by 谷歌翻译
肝癌是世界上最常见的恶性疾病之一。 CT图像中肝脏肿瘤和血管的分割和标记可以为肝脏肿瘤诊断和手术干预中的医生提供便利。在过去的几十年中,基于深度学习的自动CT分段方法在医学领域得到了广泛的关注。在此期间出现了许多最先进的分段算法。然而,大多数现有的分割方法只关心局部特征背景,并在医学图像的全局相关性中具有感知缺陷,这显着影响了肝脏肿瘤和血管的分割效果。我们引入了一种基于变压器和SebottLenet的多尺度特征上下文融合网络,称为TransFusionNet。该网络可以准确地检测和识别肝脏容器的兴趣区域的细节,同时它可以通过利用CT图像的全球信息来改善肝肿瘤的形态边缘的识别。实验表明,TransFusionNet优于公共数据集LITS和3DIRCADB以及我们的临床数据集的最先进方法。最后,我们提出了一种基于训练模型的自动三维重建算法。该算法可以在1秒内快速准确地完成重建。
translated by 谷歌翻译
现有的RGB-D显着性检测模型没有明确鼓励RGB和深度来实现有效的多模态学习。在本文中,我们通过互信息最小化介绍了一种新的多级级联学习框架,以“明确”模拟RGB图像和深度数据之间的多模态信息。具体地,我们首先将每个模式的特征映射到较低的维度特征向量,并采用互信息最小化作为常规器,以减少来自RGB的外观特征与来自深度的几何特征之间的冗余。然后,我们执行多级级联学习,在网络的每个阶段强加相互信息最小化约束。基准RGB-D显着数据集的广泛实验说明了我们框架的有效性。此外,为了繁荣发展该领域,我们贡献了最大(比NJU2K大7倍)数据集,其中包含具有高质量多边形/杂文/对象/ instance- / rank级注释的15,625图像对。基于这些丰富的标签,我们另外构建了具有强大基线的四个新基准,并观察了一些有趣的现象,可以激励未来的模型设计。源代码和数据集可在“https://github.com/jingzhang617/cascaded_rgbd_sod”中获得。
translated by 谷歌翻译
Deep learning on graph structures has shown exciting results in various applications. However, few attentions have been paid to the robustness of such models, in contrast to numerous research work for image or text adversarial attack and defense. In this paper, we focus on the adversarial attacks that fool the model by modifying the combinatorial structure of data. We first propose a reinforcement learning based attack method that learns the generalizable attack policy, while only requiring prediction labels from the target classifier. Also, variants of genetic algorithms and gradient methods are presented in the scenario where prediction confidence or gradients are available. We use both synthetic and real-world data to show that, a family of Graph Neural Network models are vulnerable to these attacks, in both graph-level and node-level classification tasks. We also show such attacks can be used to diagnose the learned classifiers.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
When using LiDAR semantic segmentation models for safety-critical applications such as autonomous driving, it is essential to understand and improve their robustness with respect to a large range of LiDAR corruptions. In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions. To rigorously evaluate the robustness and generalizability of current approaches, we propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy. Then, we systematically investigate 11 LiDAR semantic segmentation models, especially spanning different input representations (e.g., point clouds, voxels, projected images, and etc.), network architectures and training schemes. Through this study, we obtain two insights: 1) We find out that the input representation plays a crucial role in robustness. Specifically, under specific corruptions, different representations perform variously. 2) Although state-of-the-art methods on LiDAR semantic segmentation achieve promising results on clean data, they are less robust when dealing with noisy data. Finally, based on the above observations, we design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications. It is promising that our benchmark, comprehensive analysis, and observations can boost future research in robust LiDAR semantic segmentation for safety-critical applications.
translated by 谷歌翻译